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1. INTRODUCTION 

The purpose of this short paper is to share a recent obser- 
vation I made in the context of my introductory graduate 
course on MapReduce at the University of Maryland. It is 
well known that since the sort/shuffle stage in MapReduce 
is costly, local aggregation is one important principle to de- 
signing efficient algorithms. This typically involves using 
combiners or the so-called in- mapper combiner technique ^ . 
However, can we be more precise in formulating this design 
principle for pedagogical purposes? Simply saying "use com- 
biners" or "use in-mapper combining" is unsatisfying because 
it leaves open the obvious question of how? What follows 
is my attempt to formulate a more precise design principle 
in terms of monoids — the idea is quite simple, but I haven't 
seen anyone else make this observation before in the context 
of MapReduce. 

Let me illustrate with a running example I often use to 
illustrate MapReduce algorithm design, which is detailed in 
Lin and Dyer [5]. Given a large number of key-value pairs 
where the keys are strings and the values are integers, we 
wish to find the average of all the values by key. In SQL, 
this is accomplished with a simple group-by and AvG. Here 
is the nai've MapReduce algorithm: 

Algorithm 1 
1: class Mapper 

2: method MAp(string i, integer r) 
3: EMlT(f , r) 

1: class Reducer 

2: method REDUCE(string t, integers [ri, r2, . . .]) 

3: sum <~ 

4: cnt ^ 

5: for all r G [ri, r2, . . .] do 

6: sum <r~ sum + r 

7: cnt •<— cnt + 1 

8: ravg -^ sum/ cnt 

9: EMlT(f , ra^g) 

This isn't a particularly efficient algorithm because the 
mappers do no work and all data are shuffled (across the net- 
work) over to the reducers. Furthermore, the reducer cannot 
be used as a combiner. Consider what would happen if we 
did: the combiner would compute the mean of an arbitrary 
subset of values with the same key, and the reducer would 
compute the mean of those values. As a concrete example, 
we know that: 

Avg(1, 2, 3, 4, 5) / Avg(Avg(1, 2), Avg(3, 4, 5)) 



In general, the mean of means of arbitrary subsets of a set 
of values is not the same as the mean of the set of values. 

So how might we properly take advantage of combiners? 
An attempt is shown in Algorithm [2] 

Algorithm 2 
class Mapper 

method MAp(string t, integer r) 
EMlT(string t, integer r) 

class Combiner 

method COMBlNE(string f, integers [ri,r2, . . .]) 
sum <— 
cnt <— 

for all integer r € integers [ri, r2, . . .] do 
sum <r- sum + r 
cnt <— cnt + 1 
EMlT(string i, pair {sum, cnt)) 

class Reducer 

method REDUCE(string t, pairs [(si, ci), (s2, C2) . . .]) 
sum <— 
cnt -ir- 

for all pair (s, c) £ pairs [(si, ci), (s2, C2) . . .] do 
sum •<— sum + s 
cnt ■<— cnt + c 

Tavg <— sum/ cnt 

EMlT(string i, integer Tavg) 



The mapper remains the same, but we have added a com- 
biner that partially aggregates results by separately tracking 
the numeric components necessary to arrive at the mean. 
The combiner receives each string and the associated list of 
integers, from which it computes the sum of those values 
and the number of integers encountered (i.e., the count). 
The sum and count are packaged into a pair and emitted 
as the output of the combiner, with the same string as the 
key. In the reducer, pairs of partial sums and counts can be 
aggregated to arrive at the mean. 

The problem with this algorithm is that it doesn't actu- 
ally work. Combiners must have the same input and output 
key-value type, which also must be the same as the map- 
per output type and the reducer input type. This is clearly 
not the case. To understand why this restriction is neces- 
sary, remember that combiners are optimizations that can- 
not change the correctness of the algorithm. So let us remove 
the combiner and see what happens: the output value type of 
the mapper is integer, so the reducer should receive a list of 
integers. But the reducer actually expects a list of pairs! The 



correctness of the algorithm is contingent on the combiner 
running on the output of the mappers, and more specifi- 
cally, that the combiner is run exactly once. Hadoop, for 
example, makes no guarantees on how many times combin- 
ers are called; it could be zero, one, or multiple times. This 
algorithm violates the MapReduce programming model. 
Another stab at the solution is shown in Algorithm O 



Algorithm 3 


f 


class Mapper 
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method MAp(string i, integer r) 
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EMIT(f, (r, f )) 
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class Combiner 
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method COMBlNE(string t, pairs [{sj 


,Cl),(s2,C2)...]) 
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sum <— 
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cnt <~ 
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for all {s,c) G [(si,ci), (s2,C2) . . 


.] do 
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sum <r- sum + s 
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cnt <— cnt + c 
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EMlT(i, {sum, cnt)) 
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class Reducer 
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method REDUCE(string f, pairs [(si. 
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sum -(r- 
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cnt <— 
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for all {s,c) G [(si, ci), (s2, C2) .. 


.] do 
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sum 'r- sum + s 
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cnt <— cnt + c 
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ravg -^ sum/ cnt 
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EMlT(t,ra«g) 





The algorithm is now correct. In the mapper we emit as 
the intermediate value a pair consisting of the integer and 
one — this corresponds to a partial count over one instance. 
The combiner separately aggregates the partial sums and 
the partial counts (as before) , and emits pairs with updated 
sums and counts. The reducer is similar to the combiner, 
except that the mean is computed at the end. In essence, 
this algorithm transforms a non-associative operation (mean 
of values) into an associative operation (element-wise sum 
of a pair of numbers, with a division at the end). 

Finally, Algorithm |3] shows an even more efficient algo- 
rithm that exploits the in-mapper combining pattern: 



Algorithm 4 


1 


class Mapper 


2 


method Initialize() 


3 


S <— new AssociativeArray 


4 


C ^ new AssociativeArray 


5 


method Map (string i, integer r) 


6 


S{t} ^ S{t} + r 


7 


C{t} ^ C{t} + 1 


8 


method Close() 


9 


for all term f G S* do 


10 


EMlT(term t,pair {S{t},C{t})) 



Inside the mapper, the partial sums and counts associ- 
ated with each string are held in memory across input key- 
value pairs. Intermediate key-value pairs are emitted only 
after the entire input split has been processed; similar to 
before, the value is a pair consisting of the sum and count. 
The reducer is exactly the same as in Algorithm [3] Mov- 



ing partial aggregation from the combiner into the mapper 
assumes that the intermediate data structures will fit into 
memory, which may not be a valid assumption. However, in 
cases where the assumption holds, the in-mapper combin- 
ing technique can be substantially faster than using normal 
combiners, primarily due to the savings in not needing to 
materialize intermediate key-values pairs. 

2. MONOIDIFY! 

Okay, what have we done to make this particular algo- 
rithm work? The answer is that we've created a monoid out 
of the intermediate value! 

How so? First, a recap on monoids: a monoid is an alge- 
braic structure with a single associative binary operatiorQ 
and an identity element. As a simple example, the natural 
numbers form a monoid under addition with the identity 
element 0. Applied to our running example, it's now ev- 
ident that the intermediate value in Algorithm [3] forms a 
monoid: the set of all tuples of non-negative integers with 
the identity element (0, 0) and the element-wise sum opera- 
tion, (a, b) © (c, d) — {a + c,b + d). 

Thus, one principle for designing efficient MapReduce al- 
gorithms can be precisely articulated as follows: create a 
monoid out of the intermediate value emitted by the mapper. 
Once we "monoidify" the object, proper use of combiners 
and the in-mapper combining techniques becomes straight- 
forwardly This principle also explains why the reducer in 
Algorithm [T] cannot be used as a combiner and why Algo- 
rithm [2] doesn't worklj 

3. OTHER EXAMPLES 

The "monoidify" principle readily explains another Map- 
Reduce algorithm I often use for pedagogical purposes: the 
problem of building word co-occurrence matrices from large 
natural language corpora, a common task in corpus linguis- 
tics and statistical natural language processing. Formally, 
the co-occurrence matrix of a corpus is a square n x n ma- 
trix where n is the number of unique words in the corpus 
(i.e., the vocabulary size). A cell niij contains the number 
of times word Wi co-occurs with word Wj within a specific 
context — a natural unit such as a sentence, paragraph, or 
a document, or a certain window of m words (where m is 
an application-dependent parameter). Note that the up- 
per and lower triangles of the matrix are identical since co- 
occurrence is a symmetric relation, though in the general 
case relations between words need not be symmetric. For 
example, a co-occurrence matrix M where niij is the count 
of how many times word i was immediately succeeded by 
word j (i.e., bigrams) would not be symmetric. Beyond 
simple co-occurrence counts, the MapReduce algorithm for 
this task extends readily to computing relative frequencies 
and forms the basis of more sophisticated algorithms such 



In many cases the operation is commutative as well, so we actu- 
ally have a commutative monoid, although in this paper I won't 
focus on this distinction (i.e., in many places where I refer to a 
monoid, to be more precise it's actually a commutative monoid). 

In AlgorithmUl the elements of the tuple have been pulled apart 
and stored in separate data structures, but that's a specific im- 
plementation choice not germane to the design principle. 

This exposition glosses over the fact that at the end of the com- 
putation, we break apart the pair to arrive at the mean, which 
destroys the monoid, but this is a one-time termination operation 
that can be treated as "post-processing". 



as those for expectation-maximization (wliere we're keeping 
track of pseudo-counts rather than actual observed counts). 
The so-called "stripes" algorithm [S] for accomplishing the 
co-occurrence computation is as follows: 

Algorithm 5 
1; class Mapper 

2: method MAp(docid a, doc d) 
3: for all term w G doc d do 

4: _H" <— new AssociativeArray 

5: for all term u £ Neighbors(iu) do 

6: H{u} ^ H{u} + 1 

7: EMlT(Term w, Stripe H) 

1; class Reducer 

2; method REDUCE(term w, stripes [Hi,H2,H3, . . .]) 

3: Hf <— new AssociativeArray 

4: for all stripe H £ stripes [Hi, H2, Hs, . . .] do 

5: SvM{Hf,H) 

6: EMlT(term lo, stripe Hf) 

In this case, the reducer can also be used as a combiner 
because associative arrays form a monoid under the opera- 
tion of element-wise sum with the empty associative array 
as the identity element. 

Here's another non-trivial example: Lin and Kolck 6 ad- 
vocate the use of stochastic gradient descent (SGD) for scal- 
ing out the training of classifiers. Viewed from this per- 
spective, SGD "works" because the model parameter (i.e., a 
weight vector for linear models) comprise a monoid under 
incremental training. 

Other examples of interesting monoids that are useful 
for large-scale data processing are found in Twitter's Al- 
gebird packageQ These include Bloom filters [Ij, count-min 
sketches [2], hyperloglog counters [4]. 

Finally, it is interesting to note that regular languages 
form a monoid under intersection, union, subtraction, and 
concatenation. Since finite-state techniques are widely used 
in computational linguistics and natural language process- 
ing, this observation might hold implications for scaling out 
text processing applications. 

4. OPTIMIZATIONS AND BEYOND 

In the context of MapReduce, it may be possible to elevate 
"monoidification" from a design principle (that requires man- 
ual effort by a developer) to an automatic optimization that 
can be mechanistically applied. For example, in Hadoop, 
one can imagine declaring Java objects as monoids (for ex- 
ample, via an interface). When these objects are used as 
intermediate values in a MapReduce algorithm, some opti- 
mization layer can automatically create combiners (or apply 
in-mapper combining) as appropriate. 

The observation that monoids represent a design principle 
for efficient MapReduce algorithms extends more broadly to 
large-scale data processing in general. One concrete example 
is Twitter's Summingbird project, which takes advantage 
of associativity to integrate real-time and batch processing. 
The same monoid (from Algebird, mentioned above) can be 
used to hold state in a low-latency online application (i.e., 
operating on an infinite stream) as well as in a scale-out 
batch processing job (e.g., on Hadoop). 



5. CONCLUSIONS 

None of the ideas in this paper are completely novel: the 
property of associativity and commutativity in enabling com- 
biners to work properly was pointed out in the original 
MapReduce paper [3]. Independently, there has been a re- 
cent resurgence of interest in functional programming and 
its theoretical underpinnings in category theory. However, 
I haven't seen anyone draw the connection between Map- 
Reduce algorithm design and monoids in the way that I have 
articulated here — and therein lies the small contribution of 
this piece: identifying a phenomenon and giving it a name. 

However, it remains to be seen whether this observation 
is actually useful. Perhaps I am gratuitously introducing 
monoids just because category theory is "hip" and in vogue. 
In a way, a monoid is simply a convenient shorthand for 
saying: associative operations give an execution framework 
great flexibility in sequencing computations, thus allow op- 
portunities for much more efficient execution. Thus, another 
way to phrase the takeaway lesson is: take advantage of as- 
sociativity (and commutativity) to the greatest possible ex- 
tent. This rephrasing conveys the gist without needing to 
invoke references to algebraic structures. 

Finally, there remains the question of whether this obser- 
vation is actually useful as a pedagogical tool for teaching 
students how to think in MapReduce (which was the origi- 
nal motivation for this paper). It is often the case that in- 
troducing additional layers of abstraction actually confuses 
students more than it clarifies (especially in light of the pre- 
vious paragraph) . This remains an empirical question I hope 
to explore in future offerings of my MapReduce course. 

To conclude, the point of this paper can be summed up 
in a pithy directive: Go forth and monoidify! 
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